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ABSTRACT 


The  main  problem  of  interest  in  this  paper  is  the  "role* 
definition  problem"  arising  in  one-on-one  dogfight  game 
models.  The  computational  approach  is  aimed  at  providing  a 
decomposition  of  the  space  of  game  initial  conditions  into 
sets  of  unilateral  capture  capability  for  each  of  the  play¬ 
ers,  and  at  outlining  the  draw  and  sacrifice  sets  in  accor¬ 
dance  with  the  players'  individual  preferences  for  game  out¬ 
comes.  The  procedure  develops  the  feedback  policy  (in  terms 
of  the  observable  data)  that  attains  the  above  decomposition. 
Two  highly  simplified  one-on-one  games  are  considered.  The 
first  game  model  is  a  discrete  time-state  alternating  move 
game  (perfect  information)  on  a  horizontal  grid  reminiscent 
of  the  Isaacs  examples.  The  second  model  is  a  continuous 
time-regional  feedback  game  (imperfect  information)  in  the 
horizontal  plane.  The  strategy  synthesis  is  effected  by  a 
"reinforcement  learning"  procedure  in  both  game  models.  Com¬ 
putational  results  are  given  in  some  detail  for  the  first 
game,  while  preliminary  results  are  presented  for  the  second 
game  model. 
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INTRODUCTION 


One  of  the  more  difficult  areas  for  applications  ori¬ 
ented  workers  in  the  field  of  modern  optimal  control  theory 
continues  to  be  the  one-on-one  aerial  dogfight  problem.  We 
believe,  in  this  case,  these  difficulties  are  due  in  part  to 
the  fact  that  the  one-on-one  dogfight  problem  is  perhaps 
more  accurately  modeled  as  a  "qualitative"  differential  game, 
as  contrasted  with  the  "quantitative"  game  model.  Briefly, 
the  qualitative  game  is  such  that  it  contains  two  or  more 
events  dealing  with  termination  of  play,  for  which  the 
players'  have  some  preferential  ordering,  as  contrasted  with 
the  quantiative  game  for  which  real  valued  payoff  functions 
defined  on  the  trajectory  and/or  terminal  data  can  be  un¬ 
equivocally  assumed  as  goals  for  each  player.  The  Isaacs 
"homicidal  chauffeur  game"  and  "game  of  two  cars"  (Ref.  1) 
are  pursuit  games  of  the  latter  type.  In  these,  the  roles 
of  pursuer  and  evader  are  clear  at  the  outset,  and  players 
seek  to  minimize  (and  maximize)  the  capture  time,  respec¬ 
tively.  Dogfight  game  models  do  not  come  equipped  a  priori 
with  the  pursuer  and  evader  roles  defined,  in  fact  these 
role  definitions  must  be  determined  in  the  course  of  obtain¬ 
ing  a  resolution  of  these  games. 

The  approach  taken  here  is  a  small  step  in  the  direction 
of  trying  to  resolve  these  dogfight  game  models.  By  resolu¬ 
tion,  we  mean  to  decompose  the  space  of  game  initial  condi¬ 
tions  into  sets  of  unilateral  capture  capability  for  each 
player  and  to  outline  the  sacrifice  and  draw  sets  in  accor¬ 
dance  with  the  players  individual  preferences  for  game  out¬ 
comes,  and  furthermore  to  derive  the  associated  strategies 
(providing  the  decomposition)  as  feedback  control  policies 
on  the  collection  of  observable  data.  Two  highly  simplified 
game  models  are  considered  in  the  text.  The  first  is  a  dis¬ 
crete  time-state  game  with  an  alternating  move  structure. 

The  second  model  is  a  continuous  time-state  game  model  em¬ 
ploying  "regional"  feedback  policies.  In  the  case  of  the 
first  model,  "perfect  information"  regarding  the  "state"  at 
each  player's  control  decision  has  been  assumed.  A  resolu¬ 
tion  of  that  game  model  for  specified  dynamics,  control  capa¬ 
bilities,  weapons  envelopes,  and  player  preferences  is  ob¬ 
tained  by  two  procedures.  The  first  procedure  is  similar  to 
that  employed  by  Isaacs  (dynamic  programming)  in  the  homi¬ 
cidal  chauffeur  game,  but  with  some  modification  to  observe 
the  stipulated  preference  descriptions  of  the  dogfight  in¬ 
stead  of  the  min  max  capture  time  criteria  of  the  chauffeur 
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game.  The  second  procedure  employs  a  "reinforcement  rule" 
algorithm  used  in  conjunction  with  the  simulation  of  game 
plays.  The  second  procedure  offers  the  conceptual  facility 
for  immediate  extension  to  the  more  complex  problem  present¬ 
ed  by  the  second  model.  The  second  model,  as  constructed, 
does  not  have  a  predetermined  move  structure  (simultaneous 
or  alternating),  but  instead  the  control  reevaiuation  points 
on  a  time  scale  are  implicitly  determined  by  the  traversing 
of  "regional"  boundaries  in  the  observables  during  the  course 
of  play.  This  "imperfect  information?' model  is  similar  in 
many  respects  to  the  one  constructed  by  et  al. 

(Refs.  2,  3)  in  their  "controllable"  Markov  whain  approach 
to  pursuit-evasion  problems.  The  text  will  outline  a  "re¬ 
inforcement  rule"  procedure  to  be  applied  in  these  models  as 
originally  described  in  Ref.  4,  and  present  some  preliminary 
computational  results  for  specific  model  data. 


DISCRETE  TIME-STATE  DOGFIGHT  GAME 


Game  Model:  Description  of  State,  Lethal  Envelopes 


The  state  relative  to  Player  I  is  given  by  the  triple 
(n,m,p).  The  admissible  control  choices  for  any  (n,m,p) 
for  Player  I  are  (see  Fig.  1);  for  Player  II  are 

vl»v2>v3  (see  Fig.  2).  We  assume  the  game  to  have  an  al¬ 
ternating  move  structure.  The  one  step  transition  equations 
for  a  move  by  Player  I  are 


n 

n 

m 

- 

m 

+ 

P 

P 

K+l 

.  1 

K 

-(^+1)  - 

1 

(—  - 

n/k^ 

0 

1  n+m\ 

\  kL/ 

l/kx 

0 

-l/\ 

unit  and  where 

if 

p  **  ±3 

u 


Fig.  3)  and  if  u  ■  u  3,  set  p  -  -3;  or  if  u  - 
set  p  -  “-3. 


The  one  step  transition  equations  for  a  move  by  Flayer 
II  are 
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In  the  above,  q  «  p  +  2  and 


£(x)  ■  +1 

if 

x  -  +1,  +2 

-  -1 

if 

x  -  -1,  -2,  +4, 

+5 

-  0 

if 

x  -  0,  ±3,  +6 

if  p  -  ±3  and  if 

v  - 

V2  or  V3,  set 

p  -  -3;  and 

if  v  ■  vj_,  set  p  ■  +3.  The  quantities  u  and  v  are 
interpreted  as  follows: 
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ki  and  k2  are  step  sizes  in  the  grid  with  which  the 
players  move  and  are  representative  of  the  velocities  with 
which  grid  points  can  be  traversed. 


Game  Outcome  Description 


In  general*  for  this  game  only  one  of  four  possible  out¬ 
comes  can  result  from  a  play  of  a  game  beginning  from  any 
(n,m,p) .  The  outcomes  are: 

Cj  capture  by  Player  I 

Cj,j.  capture  by  Player  II 

S  sacrifice  (mutual  capture) 

D  draw 

Mote:  We  have  assumed  that  first  "passage"  to  any  of 
the  outcomes  Cj,  Cjj,  or  S  terminates  play. 

On  the  basis  of  the  lethal  envelopes  illustrated  in 
Figs.  1  and  2,  the  sets  become: 
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dealing  with  termination  can  be  described  in  terms  of  the 
(n,m,p)  coordinates. 


Move  Structure  and  Information  Pattern 

We  have  postulated  an  alternating  move  structure  in  this 
discrete  game.  Therefore,  the  move  structure  and  information 
patterns  fall  into  one  of  the  two  game  structures  shown  in 
Fig.  4,  where  the  argument  of  x(*)  and  u(*),  v(*)  is  the 
time  unit. 

We  assume  the  move  structure  and  information  pattern  of 
Game  I  (e,g.,  Player  I  moves  first)  in  subsequent  discussion. 
The  game  move  structure  is  interpreted  as  follows:  the  game 
begins  at  x(0)  (coordinates  n,m,p).  Player  I  has  complete 
information,  that  is,  knowledge  of  (n,ra,p)  at  the  time  he 
makes  control  decision  u(0).  The  game  state  is  advanced  via 
the  transition  equations  to  state  x(l) ,  at  which  point 
Player  II,  having  data  x(l) ,  selects  decision  v(l),  and 
so  on,  until  a  termination  occurs.  At  this  point,  we  require 
a  stopping  time  parameter,  T,  from  which  a  draw  termination 
can  be  decided  in  a  fixed  number  of  stages  of  play. 


Strategies 


The  strategies  for  this  game  are  the  functions 
where: 

for  Player  I  x(N)  u(N) 

Player  II  x(N)  -2-*  v(N) 

Hence,  i  Is  a  mapping  from  all  x(N)  to  an  admissible  u 
(likewise  for  r\  and  v) ,  and  the  totality  of  all  C,  (q) 
the  strategy  spaces.  N  is  the  index  of  time  (or  stage)  of 
play.  In  our  algorithm  we  utilize  behavior  strategies,  and 
the  actual  choice  of  move  made  at  x(N),  is  then  accom¬ 
plished  by  sampling  from  the  stipulated  distribution. 


Outcome  Preferences 

In  line  with  our  treatment  of  dogfight  games  as  quali¬ 
tative  games,  there  exists  a  preference  for  outcomes  C^» 
CII»  S,  and  D  on  the  part  of  each  of  the  players.  For 
this  example,  a  typical  preference  ordering  might  be  given 
as : 


Player  I 


Player  II 


0^  preferred  to 
D  preferred  to 
S  preferred  t  o 

Cjj  preferred  to 
D  preferred  to 
S  preferred  to 


D,  S,  Cn 


S, 


c 


II 


c 


II 


D,  S,  Cx 

s,  cx 


C 


I 


Computational  Approach  Using  Reinforcement  Rule  Logic 

Model  Assumptions  Made  for  Computational  Expediency 

•  Truncation  of  the  game  state  to  a  finite  collec¬ 
tion.  The  truncation  is  such  that  the  region 
shown  by  the  shading  in  Fig.  5  represents  the 
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finite  collection  of  states,  while  the  region 
exterior  to  it  constitutes  termination  as  a  draw 
outcome.  In  realistic  models,  this  boundary 
would  be  representative  of  those  relative  range 
values  at  which  visual  or  other  contact  could 
not  be  made.  In  our  model,  therefore,  we  con¬ 
sider  that  any  path,  even  though  it  starts  in 
the  interior,  upon  reaching  the  exterior  is 
terminated  as  a  draw. 

Introduce  a  fixed  termination  time  that  terminates 
all  paths  as  draw  outcomes  beyond  the  fixed  time. 
This  time  is  a  parameter  of  the  model  and  can  be 
varied  to  examine  the  solution’s  dependence  on  the 
values  of  this  parameter. 

Strategies  are  functions  of  the  current  state  only 
and  not  time  (or  time-to-go)  and  state. 


The  Simulation  Process 
*  Data 

1)  Indexing  of  the  finite  state  1,  . ..,  N. 

2)  Dynamical  system:  one  stage  reachable  set  de¬ 
scription  given  for  Players  I  and  II. 

3)  Classification  of  outcomes:  sets  Cj,  Cji, 
S,  in  terms  of  weapons  system  descriptors 
Ax  and  Aji- 

4)  Termination  time  specified:  T. 

5)  Probability  distributions  on  control  choices 
initially  equally  likely  for  all  states  for 
both  players. 

6)  Subjective  reinforcement  rule  weightings  as¬ 
signed  to  outcomes  H  *  (Cl,  Cij,  S,  D]  in 
accordance  with  given  orderings;  weightings 

for  Player  I;  v(fi)  for  Player  II. 
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Obtaining  A  Run 


1)  An  initial  game  state  is  selected.  A  random 
number  generator  is  consulted  for  determina¬ 
tion  of  control  choice.  The  sampling  is  done 
in  accordance  with  the  probability  distribu¬ 
tions  currently  used  by  that  player  for  that 
state.  Hence,  a  pair  of  state-control  se¬ 
quences  are  generated. 

x(0) ,  u(0)  ;  x(2),  u(2)  ;  ... 

x(l) ,  v(l)  ;  x(3),  v (3)  ;  ... 

These  data  are  temporarily  stored.  An  out¬ 
come  is  observed,  say  C j ;  the  run  is  then 
terminated.  Assume  the  arbitrary  weights 


p(Cx)  - 

2.00 

v(cn) 

-  2.00 

p(D)  - 

1.00 

v(D) 

-  1 .0  0 

n(s)  - 

0.99 

v(S) 

-  0.99 

“<cn>  - 

0.5 

V(CI> 

-  0.50 

have  been  assigned. 

(These 

weights 

accordance  with  the  example  ordering  given 
earlier .) 


2)  The  reinforcement  process  is  conducted  as 
follows : 


For  Player  I: 


Assume  state  visited  u 

chosen  by  I  at  xx.  Hence 
the  distribution  at  x-;  is 
altered  from  (y,  y,  y)  to 


'1 
k  2  » 


1 

4  J 


1 


For  Player  II j  Assume  visited,  vj 

chosen  by  II  at  X£ .  Hence 
the  distribution  at  Xy  is 
altered  from  (y,  y ,  %)  to 
(2/5,  1/5,  2/5). 


I 

I 

1 

1 

i 

1 


This  procedure  is  repeated  for  states 
visited  during  that  run  by  both  players. 

Note:  This  is  an  arbitrary  procedure;  other 
possibilities  exist,  one  being  to  alter 
the  distributions  nearer  tennina tion  more 
than  those  nearer  the  start  of  that  run. 

This  is  a  point  for  further  investigation  and 
is  incorporated  in  the  continuous  model. 

Hence  for  the  procedure  described  we  change 
the  distributions  in  the  following  way:  Let 
«l<xi).  n2<  x^) ,  n^x^)  represent  nonnegative 
entries  for  Player  T.  associated  with  state 
x^.  Initially,  n^  *  n-j ;  hence 


Prob[u(x^)  -  uR] 


1  nj 


As  we  have  assumed  that  Cj  was  the  termina¬ 
tion,  then  the  new  entries  become 


yi(CI)ni(xi)  ,  n2(Xi),  n^x^  , 

since  u^  was  utilized  by  Player  I  when  Xj_ 
was  the  current  state.  These  quantities  are 
then  normalized  and  used  as  new  data  for  ob¬ 
taining  the  next  run  of  the  simulation.  (A 
similar  procedure  is  carried  out  for  Player 

II.) 


At  this  point  in  time,  our  experience  with  the  above 
model  is  not  sufficient  to  disclose  the  most  efficient  samp¬ 
ling  procedure  over  the  game  starting  conditions  nor  the  most 
efficient  reinforcement  rule  logic.  However,  our  experience 
has  shown  that  building  from  short  duration  games  from  start¬ 
ing  points  close  to  termination  outward  to  longer  duration 
games  from  more  distant  starting  points  (s^  lar  to  dynamic 
programming)  is  a  preferred  procedure  with  the  reinforcement 
rule  mentioned. 


The  Markov  Chain  Models 

As  our  interest  in  these  problems  is  to  obtain  a  decom¬ 
position  of  the  game  starting  conditions  into  sets  for  which 
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Player  I  can  capture  Flayer  II,  II  can  capture  I,  and  sets 
of  mutual  capture,  according  to  the  players'  respective  out¬ 
come  preferences,  the  following  Markov  chain  model  proves 
useful: 

•  The  transition  operators  of  Markov  Chains  are  de¬ 
scribed  first.  We  assume  that  a  sufficient  number 
of  runs  have  been  made  in  the  simulation  process 
and  that  two  families  of  stable  distributions 
representing  the  strategies  for  Players  I  and  II 
over  the  x^  have  been  obtained. 

For  Player  I  we  can  then  form  P  where 


Xj  * ,xk’ ,xl' ,XN  *N+1 


0 


*N 

*N+1 


0  0  0  0 


1 


where 


Pjj  «  Prob[x(K+l)  >=>  Xj|  x(K)  -  x^ 
In  the  above,  we  have  let 
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t* 
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I 


x2 

x3 


{*ix  e  ci  } 

{x|x  6  CH  } 
|x|x  £  S  ]  . 


We  then  require  p^j  -  1,  P22  "  1,  P33  ■  1  by  our 
first  passage  assumption.  The  entries  for  arbi¬ 
trary  row  (Pij »  Pik»  Pi/)  are  obtained  from 

two  sources:  1)  tne  numerical  value  p-jj  from 
the  converged  distributions  in  the  strategy  table 
for  the  corresponding  control  choice;  and  2)  the 
location  j,k,/  from  the  one-step  reachable  set 
properties  of  the  dynamical  system  of  Player  I. 

The  state  space  truncation  to  a  finite  collection 
N  with  termination  as  draw  outside  this  collec¬ 
tion  is  treated  by  the  additional  state  Xfj+i 
with  the  property  that 

PN+1,N+1  "  1,0  * 

A  similar  construction  is  used  to  obtain  an  opera¬ 
tor  Q  for  Player  II  analogous  to  P  for  Player 
I. 


Given  the  operators  P  and  Q,  we  can  now  compute 
the  following  conditional  probability  of  entrance: 


Prob|x(K) 


Xj_»  x(v)  i  x2,x3lx(0) 


where 


0  <  K  <  T 
0  S  V  <  K  „ 

and  where  T  is  the  stopping  time  parameter. 

Hence,  we  have  the  probability  that  play  will  first 
terminate  in  in  T  stages  or  less,  given  that 

play  began  at  x(0)-x^.  These  data  are  obtained 
in  the  first  column  of  the  matrix  [PQ]^  in  game  I 
(Player  I  movi.ng  first)  and  in  [QP]T  in  game  II 
(Flayer  II  moving  first) .  The  second  column 
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signifies  termination  in  Cjj,  the  third  column 
in  S.  These  probability  data  serve  to  provide 
the  decomposition  sought. 


Computational  Results 

Figures  6-9  show  the  decomposition  obtained  for  game  I 
with  kj  ■  2,  k2  ■  1,  T  *  10  moves  for  each  player,  the  dy¬ 
namics,  lethal  envelopes,  and  player  preferences  all  being 
assumed  as  outlined  in  the  previous  discussion.  The  plots 
for  p  ■  -1,  p  «  -2  are  not  shown  as  these  data  are  avail¬ 
able  from  their  symmetrical  counterparts  p  »  +1,  and 
p  -  +2,  respectively. 

Mote:  One  finds  that  all  strategies  are  pure  strategies  in 
the  converged  results  as  might  have  been  expected  from  the 
alternating  move  -  perfect  information  structure  of  the  prob¬ 
lem.  The  detailed  listing  of  the  associated  strategies  for 
both  players  making  up  the  decomposition  is  not  given,  be¬ 
cause  of  space  considerations. 


Computer  Considerations 

The  above  described  procedure  was  programmed  for  use  on 
an  IBM  360/75  computer.  The  model  was  composed  of  2166 
states,  (n,m,p)  triples,  by  means  of  equivalence  class  re¬ 
ductions  in  the  terminations  of  type  Cj,  Cji,  S;  the  re¬ 
sulting  state  was  reduced  to  N  -  2046  (symmetric  condi¬ 
tions  could  have  reduced  this  figure  by  nearly  half).  A 
total  of  50,000  runs  (plays)  were  made  in  arriving  at  the 
strategy  distributions.  This  required  20  minutes  of  com¬ 
puter  time.  The  conditional  probability  of  entrance  compu¬ 
tations  used  roughly  two  minutes  of  computer  time  to  obtain 
the  above  decomposition.  Symmetry  considerations  could  have 
reduced  the  running  times  to  12  minutes  for  the  example 
above . 


Storage  requirements  were  as  follows  for  the  above 
problem: 


Strategies  (probability  distribution 
as  floating  point)  One  Stage  Reachable! 
Set  (integer  packing)  Simulation 
Routine  with  Reinforcement  Rule  Logic 


100,000  bytes 
(4  bytes  per 
word) 
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Conditional  Probability  of  Entrance 
Computations  Using  Markov  Chain  Model 


120 , 000  bytes 


The  computer  utilized  has  a  500,000-byte  core  capacity. 


A  Second  Computational  Procedure 

In  this  section,  we  briefly  describe  a  procedure  simi¬ 
lar  to  that  used  by  Isaacs  (Ref.  1)  (in  solving  the  discrete 
chauffeur  game)  and  apply  it  to  the  discrete  time-state  dog¬ 
fight  model.  This  procedure  has  special  merit  in  this  per¬ 
fect  information  —  alternating  move  model  in  that  the  decom¬ 
position  of  game  initial  conditions  in  accordance  with  the 
player  preference  orderings  is  accomplished  with  minimal 
computational  expense. 

The  procedure  ie  as  follows: 

1)  Given  termination  data  Cxo,  CIIq,  Sd  (subscript 
here  refers  to  number  of  moves  by  I  to  termination 
in  Cx»  Cjx,  S) .  Given  preference  ordering  for 
outcomes  for  Individual  players. 

2)  Initialize  array 

Control 

U1  u2  u3  V1  v2  v3 


State  x^ 


I 

1 

i 

I 

i 

1 

I 


3)  Select  x^  i  Cj  U  U  SQ 


a) 


for  if 


x  €  Cj 

o 


set  x^i^  "  l*°  tn  array 


U1 

if  x.  - *  x  e  C__ 

1  Lo 

set  x.,11^  -  0  in  array 
U1 

if  - ♦  x  e  SQ 

set  x^,u^  ■  0.3  in  array 


if 


X  €  D 


set  xi»ui  “  °-7  in  array 

b)  Do  a)  over  all 

c)  For  x^ 

(1)  if  3  at  least  one  u.  «  1.0  in  array 
for  that  row  call  C  CT 

1  1 


(2) 


if  3  no  u.  -  1.0  and  at  least  one 
Uj  -  0.7  is  not  labeled 


1 

I 

I 
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(3) 


0.7, 


(A) 


if 

2 

no 

ui  "  1 

.0, 

and  no  uj 

and  at 

least  one 

UJ 

-0.3  call 

xi 

Cs: 

L 

if 

a 

no 

UJ  “  1 

•  0, 

and,  no  ui 

and  no 

Uj 

i 

t 

o 

u> 

call  C  CIX 

4)  Do  step  3)  over  predetermined  range  of 

Xi  *  CI  U  CII  U  So 
o  o 


Select 

xk i  ci 

U  cxl  U 

So  U  CI  U  CII  U  S1 

0 

0 

1  1 

i£ 

e  °u  u  cn.  set 

Xk'  U1  “  0  S°  to  5  d) 

o 

l 

t£  *t 

e  S  U  S, 
0  1 

set  xk, 

-  0.3  go  to  5  d) 

For 

VV1 

:  if  x. 

k 

U1  V1 

(1) 

if 

xm  E  CI 

o 

U  cr 

ii 

set 

xrvl  " 

0  in  array 

(2) 

if 

xm  e  CU  U  CII 
o  1 

set 

xrvi  “ 

1  in  array 

(3) 

if 

X  E  S  U  S, 
mol 

set 

xrvi  " 

0.3  in  array 

(4) 

if 

x  e  D 
m 

set 

H 

I— 1 
> 

0.7  in  array 

b)  Do  a)  over 
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1 

c)  For 

*k 

I 

(1) 

if 

X 

m 

«  CI 

o 

U  CT 
,  1 

for  all 

vj 

Jl 

set 

Vui  " 

1.0 

1 

(2) 

if 

X 

m 

6  CII 

u  cn 

0 

for  at 

1 

least  one 

J 

Vj 

set  ^ 

»«x  "  0 

1 

(3) 

if 

X 

m 

e  D 

for  at 

least  one 

Vj  and 

» 

i 

*£- 

i  C 

II 

0 

U  Cu 

for 

1 

any  Vj 

set 

.L 

V 

c 

l~* 

1 

o 

• 

*- 

< 

if 

X 

m 

e  S 

for  at 

least  one 

and 

i 

i  c 

xk» 

II  U  CII 
u1°-  0.3 

U  D 

1 

for  any 

Vj  8et 

d) 

Do 

a),  b),  c)  for  ut 

e) 

For 

xk 

(1) 

if  x,  .u,  -  1.0  for  any  entry 
call  xk  C  Cj2 

ui 

(2) 

if  xk,v»x  “  0  for  all  entries 
call  xk  C  CIl2 

ui 

(3) 

if  xk,u^  J  1,0  for  any  u^, 
xk>ui  “  0.7  for  at  least  one 
call  xk  C  D 

and 

ui 

(4)  if  xk,ux  1.0  and  x^u^  /  0.7  for 
any  and  xk,U£  «  0.3  for  at  least 

one  set  xk  C  S£ 

6)  Do  step  5)  over  predetermined  range  of 
x.  (I  CT  U  CTT  US  UCT  U  CTT  U  s, 

iC-  X  XX  O  X-|  X  X  ■*  X 

O  O  XX 
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7)  The  extension  to  3  and  more  stages  of  play  using 
steps  5)  and  6)  is  straightforward. 


Note:  A  simultaneous  move  version  of  the  discrete-time 

state  model  presented  is  currently  under  study  in  the  Grumman 
Research  Department.  In  this  case,  a  revision  of  the  pref¬ 
erence  ordering  (from  that  assumed  here)  has  been  made  to 
obtain  ultimately  a  game  for  which  a  zero-sum  payoff  property 
is  specified.  In  this  case,  one  of  the  players  is  required 
to  prefer  the  sacrifice  outcome  over  the  draw  result.  A  dy¬ 
namic  programming  procedure  is  being  used  to  conduct  the 
strategy  synthesis  with  the  optimal  mixed  strategies  in  the 
single-stage  subgames  determined  by  a  Brown -Rob in son  itera¬ 
tion  procedure.  This  procedure  was  first  outlined  by  Kopp 
(Ref.  5)  in  the  context  of  a  simpler  simultaneous  move  dog¬ 
fight  game  model. 


CONTINUOUS-TltE -DISCRETE  REGION  GAME  MODEL 
IN  THE  HORIZONTAL  PLANE 


The  Continuous -Time  Regional"  State  One-On-One  Aerial 
Combat  Model  in  the  Horizontal  Plane 

The  model  for  combat  in  the  horizontal  plane  is  a  logi¬ 
cal  extension  of  the  discrete  model  and  thus  permits  quali¬ 
tative  comparison.  Both  vehicles  are  assumed  to  have  con¬ 
stant  velocity. 


System  Equations 

The  kinematic  equations  are  similar  to  those  given  by 
Isaacs  (Ref.  1)  for  the  game  of  two  cars.  The  equations  are 
written  in  terms  of  a  coordinate  system  centered  on  Player  I 
(see  Fig.  10),  and  are  given  as 

VI 

x  ■  -  pT-  y<t>  +  Vjj  sin  8 


17 


y  .  _  -  Vj  +  Vlr  COS  8 


V  V 

•  I  II 

6  "  R  *  +  R  ^ 
RI  II 


-1*0  ,  *  $  l 


where 

Vj  and  Vjj  are  the  speeds  of  Vehicles  I  and  II,  re¬ 
spectively; 

4>  and  ip  the  control  variables  for  I  and  II,  re¬ 
spectively  (both  bounded) ;  and 

Rj  and  are  the  minimum  turn  radii  of  I  and  II, 

respectively 


with  p  -  Vx^  +  y^  (Range),  <n  (Bearing),  and  0  (relative 
heading  angle  between  and  Vjj)  . 


Observable  Data  and  Control  Variables 

Since  we  are  interested  in  constructing  feedback  con¬ 
trols,  <t>(p,o>,0)  and  ^(p,<JD,0),  let  us  look  at  a  proposed 
decomposition  of  the  visual  sphere  (or  circle  and  rays  in 
this  two  dimensional  version) .  Based  on  discussions  with 
experience  combat  pilots,  we  do  not  believe  that  relative 
range,  bearing,  or  heading  can  be  measured  accurately  in  the 
dogfight  encounter.  Thus,  the  state  of  one  aircraft  with 
respect  to  the  other  is  imperfectly  known.  To  model  this 
imperfect  information,  we  ascertained  in  a  cursory  way  what 
is  capable  of  being  known  and  to  what  degree  of  accuracy. 
These  discussions  led  to  the  partitioning  of  the  visual 
sphere  (or  visual  horizontal  plane  in  this  two  dimensional 
version)  as  shown  in  Fig.  11.  This  partitioning  is  made 
with  the  assumption  that  Systems  I  and  II  are  representative 
of  aircraft  in  the  dogfighting  situation.  The  divisions 
themselves,  such  as  Region  41  in  Fig.  11,  is  meant  to  imply 
that  Player  I  can  only  discern  that  Player  II  is  somewhere  be¬ 
tween  6000  and  12000  feet  ahead  and  somewhere  between 
0°  and  7^°  off  to  his  right.  In  the  partitioning  shown 


18 


In  Fig.  11 t  the  shaded  region  denotes  the  lethal  gun  enve¬ 
lope  of  I  and  the  region  in  which  I  uses  a  gunsight  for 
lead-pursuit  tracking.  We  have  assumed  that  a  lingering 
time  of  0.5  seconds  continuously  or  1.0  cumulative  seconds 
in  the  gun  envelope  constitutes  a  "kill;"  thisisa  modifica¬ 
tion  of  the  instantaneous  "kill"  property  of  the  discrete 
game.  The  second  player  is  assumed  to  have  a  similar  par¬ 
titioning  of  the  space. 

The  partitioned  state  space  in  p  and  m  has  a  third 
coordinate,  6,  which  we  are  assuming  again  to  be  imperfect. 
We  assume  also  that  6  is  known  only  to  lie  within  the 
values  specified  below  for  Regions  1-41  and  that  it  is  not 
discernible  for  p  >  12,000  feet  wherein  a  vehicle  would 
appear  at  best  as  a  black  dot  on  the  horizon.  Hence,  9  is 
observable  within  the  following: 

315°  <  6  $  45°  el 

45°  <  6  <,  135  $2 

135  <  6  $  225°  03 

225  <  6  £  315°  e4  . 

A  similar  breakdown  applied  to  Player  II.  Hence,  in  this 
model  we  have 

41 
x  4 

164  +11-175  regions  in  the  decomposition. 

We  have  limited  the  admissible  controls  to  be  finite  in  num¬ 
ber  (i.e.,  <i>  -  ±1,  0,  and  similarly  ijj  -  ±1,  0),  hence, 

the  probabilistic  feedback  law  would  be  represented  by  the 
following  table  of  state  doubles  Xi(R,0),  where  R  is  the 
region  and  9  is  the  relative  heading  angle  between  I  and 
II. 


For  Player  I 

State  Prob  4>  -  +1  Frob  4>  —  —  1  Prob  -  0 

xL  (Rj^-1,  e-e1) 
x2  (rl  -i,  a  -  e2) 

• 

3WR"41'  9-V 
X165(r-42,  all  e) 

* 

Xi75(R-52,  all  a) 


where  Pj^ca  is  the  probability  of  choosing  the  control 
-1  when  Pldyer  I  discerns  that  Player  II  is  in  Region  164 
with  respect  to  himself. 

For  Player  II,  we  have  a  similar  table  with  the  states  given 
by  the  proximity  of  Player  I  with  respect  to  Player  II. 

State 

xx  (R1-l,  s-ep 


Xi75(R.52>  all  0) 


The  sets  of  capture  Ct,  Cjj,  and  sacrifice  S  cannot  neces¬ 
sarily  be  identified  In  terms  of  p,co,0  at  the  outset,  even 
though  one  may  be  in  the  envelope  of  the  other,  due  to  the 
linger  time  stipulation. 
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Simulation  Procedure 


Assume  that  a  family  of  games  is  played  with  durations 

(see  Fig.  12) .  Assume  that  the  game 
£  (say  in  Region  23  for  I, 
corresponds  to  52  for  II)  and  has  duration  T^.  [We  select 
initial  conditions  close  to  termination  for  I  (and  II) .  ] 


0  <  Vi  <  T,  <  ...  <  T-  (see 
begins  at  Initial  conditions 


Choices  of  control  are  selected  from  X(R-23,  9 -  9^) 
for  Player  I  and  X(R«52,  all  6)  for  Player  II.  Say,  fcr 
arguments  sake,  that  they  are  4>  «*  +1  and  ip  «  +1,  respec¬ 
tively.  The  differential  equations  are  integrated  from  ^ 
using  ♦  —  +1  and  ip  «  +1  until  Region  23  for  I  or  Region 
52  for  II  is  exited.  If  either  occurs  (or  both),  the  new 
region  for  that  player  is  consulted  (say  23  -*■  22  for  I, 

II  continues  with  .  52);  hence  X(R  -  22,  0  -  6i)  is  con¬ 
sulted  for  the  next  control  decision  for  I  which,  say,  is 
■  0.  We  continue  in  this  way  until  an  outcome  Cj,  Cji* 

S,  or  Ti  is  observed.  Meanwhile,  the  "state-control'1 
pairs  have  been  temporarily  stored.  For 

I  23,  $  ■  +1  ;  22,  4>  -  0  ;  ... 

II  52 ,  ip  •  +1  ;  ... 

As  in  the  discrete  model,  the  reinforcement  rule  is  applied 
to  alter  the  distributions  with  respect  to  the  temporarily 
stored  data.  We  have  modified  the  reinforcement  rule  to  be 
other  than  multiplication  of  the  control  choice  chain  during 
any  one  run  by  a  constant  and  then  normalizing.  We  have  in¬ 
corporated  a  linear  weighting  that  reinforces  the  control 
choice  chain  more  strongly  after  many  plays  of  the  game, 
hopefully  avoiding  the  reinforcement  of  a  basically  poor 
choice  of  control  that  may  have  led  to  a  successful  outcome 
on  the  part  of  one  player  because  the  second  had  not  yet 
learned  how  to  play  adequately.  We  repeat  this  procedure 
over  many  4  in  regions  close  to  termination  using  time 
parameter  T]_;  T2  is  then  selected,  and  experiments  re¬ 
peated  over  £  in  regions  not  previously  covered  by  experi¬ 
ments  using  T]_ . 


Note:  In  this  model  we  do  not  have  to  decide  whether  the 

game  is  of  simultaneous  or  alternating  move  structure;  the 
sequence  of  moves  in  time  resolves  itself  in  accordance  with 
the  assumed  decomposition  among  the  observables  and  the  in¬ 
tegration  of  the  kinematic  equations.  It  should  also  be 


noted  that  we  have  need  the  y-axis  as  a  reflecting  barrier 
and  thereby,  by  symmetry,  have  reduced  the  number  of  stored 
states  in  our  feedback  representation,  and  subsequently  in 
our  simulation. 


Preliminary  Computational  Results 

The  results  presented  for  the  continuous  2-D  model  are 
by  no  means  complete,  but  these  results  do  indicate  that  the 
reinforcement  algorithm  developed  for  the  discrete  game  car¬ 
ries  over  directly  to  the  continuous  one. 

Region  22  (£].)  as  shown  in  Fig.  13  (and  designated 

simply  as  22  in  Fig.  11)  is  considered  representative  of  a 
region  close  to  termination.  We  are  seeking  to  ascertain 
the  control  policy  probability  distributions  on  the  part  of 
both  players  for  encounters  that  begin  therein.  We  are  also 
seeking  the  probability  of  the  various  possible  outcomes, 

C £,  Cjj,  S,  and  D.  We  fix  the  converged  control  policies 
for  Region  22  (0^)  and,  knowing  the  probabilistic  out¬ 

comes  for  play  entering  that  region,  go  on  to  consider  Re¬ 
gion  22  (02)*  We  start  play  in  the  latter  region  and  ter¬ 
minate  play  if  we  enter  Region  22  (0^) ,  which  has  been 

previously  decided,  or  terminate  by  the  occurrance  of  one  of 
the  possible  outcomes  prior  to  entering  Region  22  (0^) .  We 

reinforce  accordingly,  and  begin  new  encounters  until  the 
control  choice  probability  distribution  becomes  invariant 
for  Region  22  (02)* 

The  particular  parameters  that  were  chosen  in  this  2-D 
continuous  model  were  Vj  «  1000  ft/sec,  V^j  =  500  ft/sec, 
Rj  »  3000  ft,  and  =  2500  ft.  Investigation  of  the 

time  that  any  one  play  from  a  given  initial  condition 
lasts,  before  a  draw  is  considered  the  outcome,  resulted  in 
a  time  of  100  seconds.  At  a  relative  velocity  between  the 
two  players  of  500  ft/sec  this  time  is  sufficient  lor  the 
faster  player  to  catch  the  slower  if  the  slower  is  near  the 
edge  of  the  visual  threshold,  as  shown  in  Fig.  11,  and 
headed  in  the  same  direction. 

The  primary  question  to  which  we  addressed  ourselves 
was:  What  is  the  most  favorable  probability  distribution 
on  the  choice  of  control  decisions  for  Player  I  when  he 
finds  Player  II  in  Region  22  (0]_)?  Note  that  even  though 

II  is  always  in  Region  22  (0^)  with  respect  to  I,  I  is  not 

necessarily  in  the  same  region  with  respect  to  Player  II  at 
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the  beginning  of  play.  We  utilize  80  particular  sets  of 
initial  conditions;  these  are  specified  as  all  combinations 
of  p  -  3600  ft,  4200  ft,  4800  ft,  5400  ft;  o>  -  1.5°,  3.0*\ 
4.5°,  6.0°,  and  6  -  -44°,  -22.5°,  0°,  22.5°,  44°,  all  of 
which  fall  into  Region  22  (0i)  of  ll/I  (Player  II  with 

respect  to  Player  I). 


We  begin  by  dividing  the  unit  interval  equally  into 
*'•0  parts,  with  each  part  corresponding  to  one  of  the  80 
p,o 3,6  triples  (initial  conditions).  Starting  from  a  uni¬ 
form  distribution  on  the  control  policies  of  Player  I  for 
Region  22  (el)>  we  select  an  initial  condition  randomly, 

run  a  game,  observe  the  outcome,  make  the  reinforcement  ac¬ 
cordingly,  and  choose  another  initial  condition;  then  a 
game  is  run,  etc.,  etc.  This  resulted  in  a  single  distribu¬ 
tion  for  the  region  which  was  pLT  “  1.0,  pgA  >=>  0,  and 
Pj^rj.  -  0,  where  -  probability  of  making  a  Left  Turn, 

PSA  “  Probability  of  going  Straight  Ahead,  and  prT 
probability  of  making  a  Right  Turn.  The  results  of  running 
1000  random  initial  conditions  chosen  from  the  80  allowable 
yielded  p^  ■  0.885,  Pcrr  “  0.030,  pg  -  0,  and  pD  =  0.085. 
Many  of  the  araw  outcomes  and  captures  by  Player  II  occurred 
during  the  first  few  hundred  games.  If  we  look  at  games  500 
through  1000,  the  pc  -  0.940,  pc  »  0.020,  ps  ~  0,  pd  « 
0.040,  which  looks  very  good  for  Mayer  I.  One  might  con- 
lecture  that  a  left  turn  when  the  opponent  is  ahead  and 
slightly  to  the  right  is  not  the  best  policy;  but  after 
tracing  a  few  of  the  plays  through,  one  sees  that  Player  I 
turns  left  as  a  delaying  maneuver  and  then  right  (II/ I  is  in 
Region  23  (9^)  or  in  Region  23  ( ©2 )  as  he  turns  right) 

since  he  has  a  closing  velocity  of  500  ft/sec.  If  he  had 
gone  straight,  Player  II  would  have  turned  left  and  could 
have  held  I  in  the  weapons  envelope  as  he  passed  II.  If  he 
turned  right,  Player  II  could  have  made  a  much  sharper  right 
and  obtained  a  draw.  Using  a  different  random  number  gener¬ 
ator  for  selecting  the  initial  conditions  and  the  control 
choices  led  to  PGj  "  0.940,  Pc-rj  e  0.009,  pg  -  0  and  p^  - 
0.051,  but  the  control  policy  for  Player  I  converged  to 
pLT  *  0,  pgA  ■  1.0,  prt  11  0  which  tends  to  indicate  that 
making  a  left  turn  or  going  straight  ahead  on  the  part  of 
Player  I  are  equally  good  policies  and  result  in  a  high 
probability  of  capture.  Player  II' s  control  policy  choice 
for  the  initial  condition  at  the  end  of  1000  games  was  vir¬ 
tually  a  uniform  distribution  in  both  cases,  indicating  that 
all  choices  of  control  on  his  part  were  equally  bad  due  to 
his  being  beaten  so  many  times.  Other  regions  converged 
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during  these  1000  runs  such  as  Regions  23  (©]_)  and  23  (©2) 
which  converged  to  ■  0,  ps^  -  0,  and  pRT  -  1.0. 

The  procedure  at  this  point  is  to  take  the  resulting 
distribution  tables  for  each  player  and  start  play  in  an 
adjacent  region  such  as  22  (63)  [since  Region  22  (62) 

had  converged  to  p-,™  »  1.0,  pg^  ■  0,  Prt  ■  0  in  the  prior 
run]  and  allow  the  distributions  to  change.  Note  that  those 
regions  for  which  the  probability  distribution  on  the  con¬ 
trols  has  gene  to  1,  0,  0  can  never  be  altered  by  this 
algorithm.  We  can  also  terminate  play  when  one  of  those 
regions,  such  as  Region  22  (0^),  from  which  we  have  al¬ 

ready  simulated  play,  is  entered  since  we  already  know  the 
outcome  which  began  in  that  region. 


CONCLUSIONS  AND  DIRECTIONS  FOR  FUTURE  WORK 

It  is  clear  at  this  point  that  the  general  solvability 
of  realistic  one-on-one  dogfight  game  models  is  far  from 
being  an  accomplished  fact.  In  reality,  it  is  not  clear  at 
present  that  any  single  computational  approach  today  would 
have  the  requisite  efficiency  and  capacity  to  handle  the 
variety  of  detailed  game  models,  in  which  veteran  combat 
pilots  might  place  an  ultimate  faith.  Despite  this,  there 
is  a  great  deal  of  information  of  a  general  nature  that  can 
be  gained  with  these  simple  models.  For  example,  obtaining 
the  decompositions  of  the  game  initial  conditions  in  a  sys¬ 
tematic  way  can  lead  to  parametric  studies  involving: 

1)  vehicle  parameters;  2)  weapon  systems  parameters; 

3)  observable  data  changes;  and  4)  player  preference  order¬ 
ing  changes,  etc.  In  this  way,  the  improved  capability  due 
to  a  vehicle-weapons  system's  change  can  be  directly  mea¬ 
sured  by  the  "volume"  increase  of  space  of  initial  condi¬ 
tions  for  which  that  system  has  unilateral  capture  capa¬ 
bility;  or  as  might  be  the  case,  with  improvements  in  the  ob¬ 
servable  data,  improvements  in  the  capture  probabilities  as 
well.  The  associated  strategies  for  attaining  these  decom¬ 
positions  would  also  be  obtained  when  making  these  studies. 

An  additional  use  for  such  simple  models  and  their  resolu¬ 
tion  may  be  to  provide  the  mere  complex  and  extremely  de¬ 
tailed  digital  simulation  efforts,  with  the  approximate  lo¬ 
cation  of  the  boundaries  making  up  the  initial  condition 
decomposition  and  the  associated  strategies.  The  computa¬ 
tional  method  presented  here  was  utilized  in  a  simplified 


form  and  although  the  results  sought  were  obtained,  the  al¬ 
gorithm  as  applied  in  these  game  models  is  computationally 
inefficient.  Efforts  are  underway  to  devise  better  sam¬ 
pling  procedures  and  more  sophisticated  reinforcement  learn¬ 
ing  rules  in  these  models. 
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Fig.  1  Lethal  Envelope  Player  I 


Fig.  3  Relative  Heading 


Figure  7.  Decomposition  ol  Start  hip.  Conditions  lor  A]]  N,  M,  with  P  =  +1 


Figure  9.  Decomposition  of  Starting  Conditions  for  All  N,  M,  with  P  =  3 
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